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ABSTRACT 

In recent years multilayer perceptrons (MLPs) with many hid¬ 
den layers Deep Neural Network (DNN) has performed sur¬ 
prisingly well in many speech tasks, i.e. speech recognition, 
speaker verihcation, speech synthesis etc. Although in the 
context of Fq modeling these techniques has not been ex¬ 
ploited properly. In this paper, Deep Belief Network (DBN), a 
class of DNN family has been employed and applied to model 
the Fq contour of synthesized speech which was generated by 
HMM-based speech synthesis system. The experiment was 
done on Bengali language. Several DBN-DNN architectures 
ranging from four to seven hidden layers and up to 200 hid¬ 
den units per hidden layer was presented and evaluated. The 
results were compared against clustering tree techniques pop¬ 
ularly found in statistical parametric speech synthesis. We 
show that from textual inputs DBN-DNN learns a high level 
structure which in turn improves Fq contour in terms of ob¬ 
jective and subjective tests. 

Index Terms — Fq Modeling, DBN, Speech Synthesis, 
Bengali. 

1. INTRODUCTION 

Prosody plays the most important role in generating natu¬ 
ral and intelligent speech. Prosody is a collection of supra- 
segmental features (duration, intonation, co-articulation pat¬ 
tern) which contributes additional information to speech that 
do not found in text. In prosody, Fq contour has a signih- 
cant contribution that is crucial to human speech perception. 
Knowledge of these parameters are implicit in speech signal 
i.e. it’s hard to capture the rules governing these knowledge. 
Traditional shallow architectures (i.e. statistical model with 
few levels of computation units) are fine in limited domain but 
they are not capable of handling these parameters when there 
are lots of variations in the test set. In this context, DNN is 
famous for its ability to capture internal representations that 
become increasingly complex. 

In recent years there has been a lot of interest in applying 
DNN in different speech processing tasks. Main advantage of 


these deep learning techniques that they can learn from un¬ 
labeled data and only a limited number of labeled example 
are needed to fine-tune the model according to the specific 
tasks at hand. DNNs are basically a multilayer perceptron 
with many hidden layers. DBN is a class of DNN family. 
It is a probabilistic generative model with multiple layers of 
stochastic, hidden variables. Each pair of layers is treated as 
a Restricted Boltzmann Machine (RBM) which is a bipartite 
undirected graphical model with two-layer architecture. The 
training of DBN as described in III is to first initialize the 
weights of each layer greedily in a purely unsupervised way 
and then fine-tune all the weights jointly to further improve 
the likelihood. The resulting DBN is considered as a hier¬ 
archy of non-linear feature detectors that can learn complex 
statistical patterns. Rather than initialize random weights in a 
DNN, the weights learned by DBN can be used as the weights 
of a DNN. This is commonly called pre-training of DNN. 
The whole DNN can be further hne-tune by a small num¬ 
ber of labeled training data. DBN-DNN (DNN pre-trained 
by DBN and fine-tune by labeled data) is successfully ap¬ 
plied in speech, audio, image and text data El El- This 
advances triggered interest in applying deep learning tech¬ 
niques in speech synthesis tasks. In recent years, DBNs have 
been successfully applied to modeling speech signals, such as 
spectrogram coding a, speech recognition El, and acoustic- 
articulatory inversion mapping 0, where they mainly act as 
the pre-training methods for a deep autoencoder or a deep 
neural network (DNN). In statistical parametric speech syn¬ 
thesis domain DBNs have also been studied very recently El 

El 191- 

Use of DBN to model Fq contour 0 m was not new. 
But in ifTol DBN was used as feature extractor for Gaussian 
Process Regression which is a non-parametric model. Our 
work was different in the sense that our prediction model 
was based on DNN with weight initialized by DBN. This 
makes the proposed model completely parametric which has 
many advantages like smaller foot print in contrast to non- 
parametric model. By doing that the proposed method could 
be used in hand held devices as a stand alone application. An- 


other aspect was rather than discontinues Fq contour ID we 
train the DNN with continues contour which adds simplic¬ 
ity to the model. 

This paper organized as follows. In Section 2, we will 
briefly review the basic techniques of RBMs and DBNs. 
In Section 3, we will describe the details of our proposed 
method. Section 4 reports our experimental results. Section 5 
gives the conclusion and the discussion on our future work. 

2. ARCHITECTURE OE DEEP BELIEF NETWORK 
2.1. Restricted Boltzmann Machines 


H 

p{vj = l|h; e) = (t(^ + bi) (5) 

where a{x) is the activation function. Here a{x) = 
was considered. Taking the stochastic gradient descent of the 
negative log likelihood / the update rule for the RBM weights 
as 
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where a is the learning rate and m is the momentum factor 
used to smooth out the weight updates. General form of the 
derivative of the log likelihood of the data can be written as 


RBM is a special type of Markov random field that has one 
layer of (Bernoulli) stochastic hidden units and one layer of 
(Bernoulli or Gaussian) stochastic visible or observable units. 
There are no visible-visible or hidden-hidden connections but 
all the visible units are connected to all the hidden units.The 
weights between the connections of the visible units v and 
hidden units h define a probability distribution over the vis¬ 
ible units V via an energy function IfTTI . Depending on the 
visible unit (i.e. Bernoulli or Gaussian) there are two types 
of energy function (i.e. Bernoulli (visible)-Bernoulli (hidden) 
and Gaussian (visible)-Bernoulli (hidden)) of the joint con¬ 
figuration (v,h). Gaussian-Bernoulli RBMs is used to con¬ 
vert real-valued stochastic variables into to binary stochastic 
variables, which then further processed using the Bernoulli- 
Bernoulli RBMs. In this work Bernoulli-Bernoulli is used 
and its energy function is defined as 
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Where 9 = (w, b, a) and wtj represents the symmetric inter¬ 
action term between visible unit Vi and hidden unit hj, bi and 
aj the bias terms, and V and H are the numbers of visible and 
hidden units. The joint distribution p(v,h;0) over the visible 
units V and hidden units h, given the model parameters 6, in 
terms of an energy function E(v,h;6>) is defined as 


p(v,h;«) = 


( 2 ) 


where Z = Y,y'Ehexp{—E{\,h; 9)) is a normalization factor. 
The marginal probability that the model assigns to a visible 
vector v is 


= (3) 

z 

As there are no hidden-hidden or visible-visible connections, 
the conditional distributions are factorial and are given by 
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where Edata{vi, hj) is the expectation observed in the train¬ 
ing set and Emodei{vi,hj) is that same expectation under 
the distribution defined by the model. But, Emodei{vi,hj) 
is computationally very expensive to compute so the con¬ 
trastive divergence (CD) algorithm ca to the gradient is 
used where Edata(,Vi,hj) is replaced by running the Gibbs 
sampler initialized at the data for one full step. 


2.2. Deep Belief Network 

A DBN is formed by Stacking a number of the RBMs learned 
layer by layer from bottom up. In this model, each layer cap¬ 
tures the correlations among the activities of hidden features 
in the layer below. The top two layers of the DBN form 
an undirected bipartite graph. The lower layers form a di¬ 
rected graph with a top-down direction to generate the visible 
units. Given the training samples of the visible units, it is 
difficult to estimate the model parameters of a DBN directly 
under the maximum likelihood criterion due to the complex 
model structure with multiple hidden layers. Therefore, a 
greedy learning algorithm has been proposed and popularly 
applied to train the DBN in a layer-by-layer manner HI. Af¬ 
ter learning Bernoulli-Bernoulli RBM the activation proba¬ 
bilities of its hidden units was treated as the data for training 
the Bernoulli-Bernoulli RBM one layer up. The activation 
probabilities of the 2”‘^-layer Bernoulli-Bernoulli RBM are 
then used as the visible data input for the 3'’‘^-layer Bernoulli- 
Bernoulli RBM, and so on. This greedy procedure above 
achieves approximate maximum likelihood learning. It has 
been proved that this greedy learning algorithm can improve 
the lower bound on the log-likelihood of the training samples 
by adding each new hidden layer III m. AlS-based parti¬ 
tion function estimation with approximate inference ifTSll used 
to estimate the lower bound on the log-probability. 


3. PROPOSED Eo MODELING APPROACH 
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As shown in Figure [T] a database of speech and correspond¬ 
ing text sentences was used as the training corpus. CRBLP 







speech corpus ifT^ is employed here for the whole experiment 
and it consists of one male voice of age 27. STRAIGHT lITSl . 
a high-quality analysis and synthesis algorithm, was adopted 
to estimate the spectrum and Fq contours with 10-ms frame 
rate. Continues Fq contour was formed using step described 
in m which resulted an approximation of original Fq con¬ 
tour consisting of third order polynomial segments. For the 
input of the neural network a set of textual features were ex¬ 
tracted from the raw text. Table [T] illustrates the features con¬ 
sidered for this work. The phoneme consists of 30 conso¬ 
nants and 16 vowels. Max syllable length 6 and max 10 sylla¬ 
ble words was considered and which was sufficient for Ben¬ 
gali language. These features were re-encoded using One- 
of-N codes which resulted 220 binary features. All though 
the phoneme and syllable properties may differ for language 
to language. To mitigate this language dependency problem 
we only need to adapt language specific text analysis mod¬ 
ule. Apart from the text analysis module the whole system is 
completely independent of Language. 


Table 1. Textual features extracted from raw text 


Feature Name 

No. of Features 

Phoneme identity 
[Previous/Current/Next] 

46 *3 = [138] 

No. of syllable in current word 

[10] 

Phoneme position in syllable 
[Forward/Backward] 

6*2 = [12] 

Syllable position in word 
[Forward/Backward] 

10 * 2 = [20] 

No. of phonemes in syllable 
[Previous/Current/Next] 

6* 3 = [18] 

Vowel position in syllable 
[Forward] 

[6] 

Vowel identity in the syllable 

[16] 

Total 

220 


3.1. DBN Training 

Input to the DBN was binary i.e. we used Bernoulli-Bernoulli 
RBM. From the corpus 7000 sentences are chosen to train 
the DBN. Each DBN layer was pre-trained for 50 epochs as 
a RBM with mini-batch of size 10. Average gradients were 
computed on the mini-batches and parameters were updated 
with a learning rate of 0.002 and a momentum of 0.95. Dif¬ 
ferent architecture of DBN illustrated in Figure 3 were trained 
with these configuration. 

3.2. DNN Training 

DNN is pre-trained by the DBN which means weights of the 
DNN is initialized from the trained DBN. In order to fine- 
tune the DNN 1000 sentences (excluding the previous 7000) 


are taken from the CRBLP corpus and they are phonetically 
aligned by HTK toolkit (5-state HMM). Finally the phonetic 
boundaries are manually corrected. From the 1000 sentences 
500 are chosen to train, 200 to cross validation, 300 to test 
the DNN. Output of the DNN are log Fq values corresponds 
to each phoneme state. This is because sentences are seg¬ 
mented using 5-state HMMs which results 5 states for each 
phoneme or each observation corresponds to roughly 1/5 of 
the phonemes. Fq values corresponding to each state were 
calculated from the continues Fq contour with the duration 
information generated by HTK. These Fq values act as the 
output of DNN. We used mean squared error as the objective 
function with sparsity target of 0.05 and 0.002 weight decay. 
With mini-batches of 100 states backpropagation training was 
evaluated using a cross validation set. Loss is measured for 
the 10 epochs and if loss increased then learning rate was de¬ 
creased by a factor of two. 

The experiment was carried out on DELL precision 
T3600 workstation which is a 6 core computer with a CPU 
clock speed of 3.2 GHz, 12MB of L3 cache and 64GB DDR3 
RAM. The training also used an NVIDIA Quadro 4000 gen¬ 
eral purpose graphical processing unit (GPGPU) for matrix 
multiplication. 



Fig. 1. Training stage of the proposed Fq method 


3.3. Synthesis Stage 

Pigure|2] shows the synthesis procedure of the proposed sys¬ 
tem. It has two main parts - generation of the cepstrum and 
duration, and prediction of the Fq values. At the synthe¬ 
sis time the same features were extracted corresponding to 
each phoneme and they were fed to the DNN. With the previ¬ 
ously trained weights DNN then predict the Fq values. Prom 
the HTS phoneme state-duration model the duration of each 
states were extracted. Cubic spline interpolation were per¬ 
formed on the state Fq values with the duration information 
so that the resulted Fq has the same length. As a result, a 

































































Fig. 2. Synthesis stage of the proposed method 


continuous pitch contour was generated. Using MLSA filter 
synthesized speech was generated. 

4. RESULTS AND EVALUATION 

Objective evaluation are conducted to evaluate the perfor¬ 
mance of the different DBN-DNN model. The best model 
is chosen to further be evaluated using subjective tests. In 
order to compare the proposed model against the clustering 
tree with multi-space probability distribution (MSD-HMM) 
Fq model included in the HTS synthesis engine toolkit flni, 
a Bengali-HTS IITSl system is constructed. It is built with 
HTS 2.2 with 34*^ order MGC coefficient, 10ms Blackman 
window. Sms shift and 0.53 frequency wrapping factor. 
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Fig. 3. RMSE and XCORR of the predicted Fq on the test set 


4.2. Subjective Evaluation 


4.1. Objective Evaluation and Model Selection 

For the objective evaluation two types of metrics i.e. cross¬ 
correlation (XCORR) and root mean square error (RMSE) 
have been calculated. Several architecture of DNN has been 
constructed and their performance is evaluated using these 
two metrics. Figure |3] illustrates the different RMSE and 
XCORR values of the predicted Fq values on the test set 
(300 sentences). From the Figure |3] it is clear that XCORR 
is improving on the test set with increasing hidden layer size 
but the performance of RMSE is decreasing. So we choose 
DBN-DNN (120U-7L) which yields the best performance 
according to the two metrics (RMSE 17 and XCORR 0.64). 
Table |2] presents the objective test results performed on the 
test set between MSD-HMM and DBN-DNN. DBN-DNN 
(120U-7L) is selected for subjective evaluation. 

Table 2. Result of objective test between MSD-HMM and 
DBN-DNN 



MSD-HMM 

DBN-DNN 

RMSE 

25.03 

17 

XCORR 

0.49 

0.64 


For subjective measurement, ABX is performed with 5 sub¬ 
jects (3 male, 2 female). All subjects are not speech experts 
and native speakers of Bengali. In this experiment, the 50 test 
sentences are synthesized using DBN-DNN (120U-7L) and 
MSD-HMM. Participants are asked to choose their preferred 
one. It can be seen from Figure|4]that 82% of the time subjects 
has a preference towards one of the two systems and majority 
(54% vs 46%) preferred proposed system and it is statistically 
significant with p < 0.001. 


■ MSD-HMM ■ DBN-DNN ■ Same preference 
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I ■ ■ ■ ■ 
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Fig. 4. ABX score of the of the two systems 










































































5. CONCLUSION & FUTURE WORKS 


Signal Processing (ICASSP). IEEE, 2013, pp. 8012- 
8016. 


In this work we have applied Deep Belief Network (DBN) to 
model the Fq contour of synthesized speech which is gener¬ 
ated by HMM-based Speech Synthesis System. DBN acted as 
a high level feature extractor from the raw input text. Neural 
network is trained for each phoneme properties i.e. for the in¬ 
put to the neural network textual features (phoneme identity, 
syllable counts etc.) and for the output normalized log Fq 
values are used. Although the whole experiment is conducted 
on Bengali Language but it can be applied to any languages. 
Erom the objective metrics and subjective test it is found that 
proposed model has more preference than MSD-HMM based 
model which is found in many standard text-to-speech syn¬ 
thesis systems. 
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